INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Notes to reviewer

From time to time, you will find the %%time command included in some cells. It is there because I was interested in knowing the execution time of some cells, especially the cells that generated complex graphics. The command is not critical to the final outcome, but interesting to know.

For the Decision Tree section, I elected not to include the export.text option as I though it would add too much length to the notebook with limited value. I found the commmand tree.plot_tree to be more valuable.

I am not the world's greatest speller so there will be spelling mistakes. Jupyter really needs a spell checker.

Background Information

I did some checking on the internet and found the following URLs and hotel cancellation information from 2018:

https://blog.experience-hotel.com/where-do-cancellations-come-from/

https://www.hotelmanagement.net/tech/study-cancelation-rate-at-40-as-otas-push-free-change-policy

https://hospitalitytech.com/global-cancellation-rate-hotel-reservations-reaches-40-average

Key facts concerning hotel bookings:

  1. Online Distribution has grown by 46.7% between 2014 and 2018
  2. 71% of online distribution for independent hotels is generated by online travel agencies in 2018
  3. Booking Holdings (company holding booking.com) holds 68% of the OTA market share in 2018
  4. Wholesalers and bed banks have grown by over 100% in 5 years
  5. Website Direct remains the second most important sales channel, with 20.9% of market share
  6. While still the second best channel, Website Direct has lost 6.3% share in 5 years, which has been taken over by OTAs
  7. Almost 40% of on-the-books revenue is canceled before arrival in 2018
  8. Average Length-of-Stay has diminished by 12%
  9. After 4 years of negative trend, in 2018 the industry experienced an improvement in both length-of-stay and reservation value
  10. Reservations with lead times longer than 60 days are 65% more likely to be canceled

The web sites also stated it was common for guests to make reservations at multiple hotels and then whittle down the choices closer to the arrival date. Since there is no change for cancellations, it is easy for guests to do this.

Importing necessary libraries and data

Function Definitions

I am old fashioned and was originally trained in procedural code. I like defining functions at the top of the notebook.
I also find it is better to have everything defined upfront so you can define and load everything all at once. The flow of the code is also not interrupted with a function defintion.

Import Data

Data Overview

Observations:

No null or duplicated values

Observations:

Notes

Dropping Booking_ID column as it does not provide any value

Exploratory Data Analysis (EDA)

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Univariate Analysis

Number of Adults

Observations:

Number of Children

Observations:

Observations:

Number of Weekend Nights

Observations:

Number of Week Nights

Observations:

Meal Plan

Observations:

Required Car Parking Space

Observations:

Room Type Reserved

Observations:

Lead Time for Reservation

Observations:

Arrival Year, Month, and Day

Observations:

Notes:

For the arrival month, I am assuming the numbers of 1 - 12 correspond to January - December with 1 being January and 12 being December.

Observations:

Observations:

Market Segment Type

Observations:

Repeated Guest

Observations:

Number of Previous Cancellations

Observations:

Number of Previous_bookings not Canceled

Observations:

Average Price per Room

Observations:

Number of Special Requests

Observations:

Booking Status

Observations:

Bivariate Analysis

Correlation

Observations:

Arrival Year and Arrival Month

This is plotting and analyzing the data over the entire 18 months

Observations:

Market Segment Type and Average Price per Room

Observations:

Market Segment Type and Booking Status

Observations:

Number of Special Requests and Booking Status

Observations:

Number of Special Requests and Average Price per Room

Observations:

Average Price per Room and Booking Status

Observations:

Lead Time and Booking Status

Observations:

Number of Adults and Booking Status

No children
2 or more adults

Observations:

Number of Family Members

Observations:

Length of Hotel Stay

Observations:

Note: pay attention to the legend, the colors have changed.

Repeat Guests and Booking Status

Observations:

Cancellations over Time

Notes:

For the following, I wanted to look at the reservations over the entire time period as there is a difference between the numbers for the months in 2017 as opposed to the same months in 2018.

The following code assigns a number to each year and month combination. 2017 and 7 is 1 and 2018 and 12 is 18.

There is probably more elegant way to write the following code, but it works.

Observations:

Note: The colors in the legend have changed.

Average Price per Room over Time

Observations:

Booking Status over time for Market Segment: Aviation

Observations:

Booking Status over time for Market Segment: Complementary

Observations:

Booking Status over time for Market Segment: Corporate

Observations:

Booking Status over time for Market Segment: Offline

Observations:

Booking Status over time for Market Segment: Online

Observations:

Note: Check the legend as the colors have changed

Data Preprocessing

Outlier Check

Outlier Treatment

I am not going to treat the outliers as it is inheritant in the data.
Treating outliers would have a big change on the data and results.

EDA

Data Preparation for Modeling

Split the data

Add intercept to independent variables

Create Dummy Variables

Split the independent data 70:30 into train and test

Checking Multicollinearity

Columns with a high VIF value are:

As the market_segment_type column is a category, the VIF values can be ignored.
This is documented in the notebook: IncomeGroupClassification_CaseStudy_DSBA.ipynb in the Multicollinearity section.

Dropping high p-value variables

Observations:

Converting coefficients to odds

Checking model performance on the training set

Observations:

ROC-AUC

Optimal threshold using AUC-ROC curve

Observations:

Checking model performance on training set

Observations:

Precision-Recall curve to see if there is a better threshold

Checking model performance on training set

Observations:

This is the model to use!

Check the performance on the test set

Using model with default threshold

ROC curve on test set

Observations:

The model with a threshold of 0.42 is the model to use!

Final Model Summary

Observations:

The model with a threshold of 0.42 is the model to use!

Building a Decision Tree model

Build Decision Tree Model

Check Performance on Training Set

Observations:

Of course, the metrics are all great since the decision tree is completely overfit

Check Performance on Test Set

Observations:

Overfit in the test data

Visualize Decision Tree

Observations:

Observations:

There are many columns that have little or no relative importance. They can be pruned.

Pruning the tree

Pre-Pruning using GridSearchCV

Check Performance on Training Set

Observations:

Check Performance on Test Set

Observations:

Visualize Decision Tree

Observations:

Observations:

Cost Complexity Pruning

Observations:

The alpha does not change that much until you get down to around 500 node and around 10 on the depth of the tree. Alpha then begins to flatten out.

F1 Score vs alpha for training and testing sets

Performance on Training Set

Observations:

The metrics are better for Cost Complexity as opposed to Pre-Purne and are better than the Baseline

Performance on Test Set

Observations:

The metrics are better for Cost Complexity as opposed to Pre-Purne and are better than the Baseline

Visualize Decision Tree

Observations:

Observations:

Model Performance Comparison and Conclusions

Actionable Insights

Recommendations

Future Recommendations